Privacy Preserving Statistical Analysis on Distributed Databases

ABSTRACT

Aggregate statistics are securely determined on private data by first sampling independent first and second data at one or more clients to obtain sampled data, wherein a sampling parameter substantially smaller than a length of the data. The sampled data are encrypted to obtain encrypted data, which are then combined. The combined encrypted data are randomized to obtain randomized data. At an authorized third-party processor, a joint distribution of the first and second data is estimated from the randomized encrypted data, such that a differential privacy requirement of the first and second is satisfied.

RELATED APPLICATION

U.S. patent application Ser. No. 13/676,528, “Privacy PreservingStatistical Analysis for Distributed Databases,” filed Nov. 14, 2012 byWang et al., incorporated herein by reference. That Applicationdetermines aggregate statistics by randomizing data from two sources fora server, which computes an empirical distribution, specifically aconditional and joint distribution of the data, while maintainingprivacy of the data.

FIELD OF THE INVENTION

This invention relates generally to secure computing, and moreparticularly to performing secure aggregate statistical analysis on aprivate data by a third party.

BACKGROUND OF THE INVENTION

One of the most noticeable technological trends is the emergence andproliferation of large-scale distributed databases. Public and privateenterprises are collecting tremendous amounts of data on individuals,their activities, their preferences, their locations, spending habits,medical and financial histories, and so on. These enterprises includegovernment organizations, health providers, financial institutions,Internet search engines, social networks, cloud service providers, andmany others. Naturally, interested parties could potentially discernmeaningful patterns and gain valuable insights if they were able toaccess and correlate the information across the databases.

For example, a researcher may want to determine the correlations betweenindividual income with personal characteristics such as gender, race,age, education, etc., or a medical researcher may want to study therelationships between disease prevalence and individual environmentalfactors. In such applications, it is imperative to maintain the privacyof individuals, while ensuring that the useful aggregate statisticalinformation is only revealed to the authorized parties. Indeed, unlessthe public is satisfied that their privacy is being preserved, theywould not provide their consent for the collection and use of theirpersonal information. Additionally, the inherent distribution of thisdata across multiple databases present a significant challenge, asprivacy concerns and policy would likely prevent direct sharing of datato facilitate statistical analysis in a centralized fashion. Thus, toolsmust be developed for preforming statistical analysis on large anddistributed databases, while addressing these privacy and policyconcerns.

It is known that conventional mechanisms for privacy, such ask-anonymization do not provide adequate privacy. Specifically, aninformed adversary can link an arbitrary amount of side information toanonymized database, and defeat the anonymization mechanism. In responseto vulnerabilities of simple anonymization mechanisms, a stricter notionof privacy, known as differential privacy, has been developed.Informally, differential privacy ensures that the result of a functioncomputed on a database of respondents is almost insensitive to thepresence or absence of a particular respondent. A more formal way statesthat when the function is evaluated on databases, differing in only onerespondent, the probability of outputting the same result is almostunchanged.

Mechanisms that provide differential privacy typically involve outputperturbation, e.g., when Laplacian noise is added to the result of afunction computed on a database, the noise provides differential privacyto the individual respondents in the database. Nevertheless, it can beshown that input perturbation approaches, such as the randomizedresponse mechanism, where noise is added to the data, also providedifferential privacy to the respondents.

It is desired to protect the privacy of individual respondents in adatabase, to prevents unauthorized parties from computing a joint ormarginal empirical probability distributions of the data, and toachieves a superior tradeoff between privacy and utility compared tosimply performing post randomization (PRAM) on the database.

Sampling can be used for crowd-blending privacy. This is a strictlyrelaxed version of differential privacy, but it is known that apre-sampling step applied to a crowd-blending privacy mechanism canachieve a desired amount of differential privacy.

The related application Ser. No. 13/676,528 first randomizesindependently data X and Y to obtain randomized data {circumflex over(X)} and Ŷ. The first randomizing preserves the privacy of the data Xand Y. Then, the randomized data {circumflex over (X)} and Ŷ arerandomized secondly to obtain randomized data {tilde over (X)} and{tilde over (Y)} for a server, and helper informationT_({tilde over (X)}|{circumflex over (X)}) and T_(Ŷ|Ŷ) for a client,where T represents an empirical distribution, and where the randomizingsecondly preserves the privacy of the aggregate statistics of the data Xand Y. The server then determines statisticsT_({tilde over (X)},{tilde over (Y)}). Last, the client applies thehelper information T_({tilde over (X)}|{circumflex over (X)}) andT_(Ŷ|Ŷ) to T_({tilde over (X)},{tilde over (Y)}) to obtain an estimated{dot over (T)}_(X,Y), wherein “|” and “,” between X and Y represent aconditional and joint distribution, respectively.

SUMMARY OF THE INVENTION

The embodiments of the invention provide a method for obtainingaggregate statistics from client data stored at a server, whilepreserving the privacy of data. The data are vertically partitioned.

The method uses a data release mechanism based on post randomization(PRAM), encryption and random sampling to maintain the privacy, whileallowing the server to perform an accurate statistical analysis of thedata. The encryption ensures that the server obtains no informationabout the client data, while the PRAM and sampling ensures individualprivacy is maintained against third parties receiving the statistics.

Means are provided to characterize how much the composition of randomsampling with PRAM increases the differential privacy compared to usingPRAM alone. The statistical utility of the method is analyzed bybounding the estimation error, i.e., the expected l₂ norm error betweena true empirical distribution and an estimated distribution, as afunction of the number of samples, PRAM noise, and other parameters.

The analysis indicates a tradeoff between increasing PRAM noise versusdecreasing the number of samples to maintain a desired level of privacyε. An optimal number of samples m* that balances this tradeoff andmaximizes the utility is determined.

Our mechanism maintains the privacy of client data stored and analyzedby a server. We achieve this with a privacy mechanism involving samplingand post randomization (PRAM), which is a generalization of randomizedresponse. The mechanism prevents unauthorized parties from computing thejoint or marginal empirical probability distributions of the data. Weachieve this using random encrypting pads, which can only be reversed byauthorized parties. The mechanism achieves a superior tradeoff betweenprivacy and utility compared to simply performing PRAM on the data.

Sampling the client data enhances privacy with respect to the individualrespondents, while retaining the utility provided to an authorized thirdparty interested in the joint and marginal empirical probabilitydistributions.

We enhance differential privacy by sampling data uniformly at a fixedrate without replacement. We formulate privacy using the definition ofdifferential privacy wherein neighboring (adjacent) databases differ bya single replacement rather than by a single addition or deletion.Furthermore, we consider vertically partitioned distributed databases,which are maintained by mutually untrusting clients. To compute jointstatistics, a join operation is required on the databases, which impliesthat individual clients cannot independently blend their respondentswithout altering the joint statistics across all databases.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 is a block diagram of private data from two client sourcesoperated on according to embodiments of the invention;

FIG. 2 and FIG. 3 are schematics of a system and method according toembodiments of the invention for deriving statistics from the data ofFIG. 1 by a third party without compromising privacy of the data;

FIG. 4 is a schematic of a release operation to recover the statistics;and

FIG. 5 is a graph of an expected l₂ norm error as a function of numberof samples for three privacy levels.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 1 shows two example databases 101-102 maintained by two clientsAlice and Bob. Alice can Bob can share a single client, or the data canbe distributed over the one or more clients. The databases containcensus-type information about individuals.

The databases are sanitized and combined 110, and stored at a server 120so that an authorized third party can analyze, e.g., salaries in thepopulation, while ensuring that the privacy of the individualrespondents is maintained. It is understood that other data can bemaintained by the clients.

Conceptually, a data release mechanism involves the “sanitization” ofthe data, via some form of perturbation or transform, to preserveindividual privacy, before making the data are made available to theserver. The suitability of the method used to sanitize the data isdetermined by the extent to which rigorously defined privacy constraintsare met.

The embodiments of our invention provide a method for obtainingaggregate statistics from the data stored at a server, while preservingthe privacy of data.

Problem Formulation

In this section, we describe the general setup. The clients, Alice andBob, wish to release their data to enable privacy-preserving dataanalysis by an authorized third party. For simplicity, we described ourproblem formulation and results for two clients. However, our methodscan be generalized to more than two clients.

It should also be understood, that the third party can be an authorizedthird party accessing a third party processor. That is, the authorizedthird party directs the processor to perform certain functions. Thethird party also supplies the appropriate keys to the processor so thatthe processor can perform decryption as necessary. Therefore, the termsauthorized third party and authorized third party processor as describedherein are essentially synonymous. Those of ordinary skill in the artswould understand the actual computations are performed by the processorunder the direction of the third party,

Consider a data mining application in which Alice and Bob are mutuallyuntrusting clients. The two databases are to be made available for athird party analysis with authorization granted by the clients, suchthat statistical measures can be computed either on the individualdatabases, or on some combination of the two databases. The clientsshould have flexible access control over the data. For example, if thethird party is granted access by Alice but not by Bob, then the thirdparty can only access Alice's data. In addition, the server should onlyhost the data and not be able access the underlying information.Therefore, the data are sanitized, before released, to protectindividual privacy.

We have the following privacy and utility requirements.

Database Security:

Only third parties authorized by the clients can extract statisticalinformation from the database.

Respondent Privacy:

Individual privacy of the respondents related to the data is maintainedagainst the server as well as the third parties.

Statistical Utility:

The authorized third party, i.e., one possessing appropriate keys, cancompute joint and marginal distributions of the data provided by theclients.

Complexity:

The overall communication and computation requirements of the systemshould be reasonable.

In the following sections, we describe our system framework andformalize the notions of privacy and utility.

Type and Matrix Notation

The type, or empirical distribution of a sequence data X^(n), is definedas a mapping T_(X) _(n) :X→[0,1] given by

${\forall{x \in X}},{{T_{X^{n}}(x)}:={\frac{\left\{ {{i\text{:}X_{i}} = x} \right\} }{n}.}}$

A joint type of two sequences X^(n) and Y^(n) is defined as a mappingT_(X) _(n) _(,Y) _(n) :X×Y→[0,1] given by

${\forall{\left( {x,y} \right) \in {XY}}},{{T_{X^{n},Y^{n}}\left( {x,y} \right)}:={\frac{\left\{ {{i:\left( {X_{i},Y_{i}} \right)} = \left( {x,y} \right)} \right\} }{n}.}}$

For notational convenience, when working with finite-domaintype/distribution functions, we drop the arguments to represent and usethese functions as vectors and matrices. For example, we can represent adistribution function P_(X):X→[0,1] as the |X|×1 column-vector P_(X),with values arranged according to a fixed consistent ordering of X.Thus, with a slight abuse of notation, using the values of X to indexthe vector, the x^(th) element of the vector P_(X)[x], is given byP_(X)(x). Similarly, a conditional distribution functionP_(Y|X):Y×X→[0,1] can be represented as a |Y|×|X| matrix P_(Y|X),defined by P_(Y|X)[y,x]:=P_(Y|X)(y|x). For example, by using thisnotation, the elementary probability identity

${\forall{y \in Y}},{{P_{Y}(y)} = {\sum\limits_{x \in X}{{P_{Y|X}\left( y \middle| x \right)}{P_{X}(x)}}}},$

can be written in matrix form as P_(Y)=P_(Y|X)P_(X).

System Framework

FIG. 2. shows the system and method framework of embodiments of ourinvention.

Database Model:

The data 101 maintained by Alice is modeled as a sequence X^(n):=(X₁,X₂, . . . , X_(n)), with each X taking values in a finite-alphabet X.Likewise, Bob's data 102 are modeled as a sequence of random variablesY^(n):=(Y₁, Y₂, . . . , Y_(n)), with each Y_(i) taking values in afinite-alphabet Y. The length n of the sequences represents the totalnumber of respondents in the database, and each (X_(i),Y_(i)) pairrepresents the data of the respondent i collectively maintained by Aliceand Bob, with the alphabet X×Y representing the domain of the data ofeach respondent.

Data Processing and Release:

Each clients applies a data release mechanism to their respective datato produce an encryption of the data for the server and a decryption keyfor the third party. These mechanisms are denoted by the randomizedmappings F_(A):X^(n)→O_(A)×K_(A) and F_(B):Y^(n)→O_(B)×K_(B), whereK_(A) 201 and K_(B) 202 are suitable key spaces, and O_(B) 203 and O_(A)204 are suitable encryption spaces. The encryptions and keys areproduced and given by

(O _(A) ,K _(A)):=F _(A)(X ^(n)),

(O _(B) ,K _(B)):=F _(B)(Y ^(n)).

The encryptions O_(A) and O_(B) are passed to the server, which performsprocessing, and the keys K_(A) and K_(B) are provided sent to the thirdparty 225. The server processes O_(A) and O_(B), producing an output Ovia a random mapping M:O_(A)×O_(B)→O 220, as given by

O:=M(O _(A) ,O _(B)).

Statistical Recovery:

To enable the recovery of the statistics of the database, the processedoutput C) is provided to the third party via the server, and theencryption keys K_(A) and K_(B) are provided to the third party by theclients. The third party produces an estimate (̂) of the joint type(empirical distribution) of Alice and Bob's data sequences, denoted by{circumflex over (T)}_(X) _(n) _(,Y) _(n) 250, as a function of O,K_(A), and K_(B), as given by

{circumflex over (T)} _(X) _(n) _(,Y) _(n) :=g(O,K _(A) ,K _(B)),

where g:O×K_(A)×K_(B)→[0,1]^(X×Y) is a reconstruction function.

The object is to design a system within the above framework, byspecifying the mappings F_(A), F_(B), M, and g, that optimize the systemperformance requirements, which are formulated below.

Privacy and Utility Conditions

We now formulate the privacy and utility requirements for our framework.

Privacy Against the Server:

During the course of system operation, the clients do not want revealany information about their data to the server, not even aggregatestatistics. A strong statistical condition that guarantees this securityis the requirement of statistical independence between the data and theencrypted versions stored at the server. The statistical requirement ofindependence guarantees security even against an adversarial server withunbounded resources, and does not require any unproved assumptions.

Respondent Privacy:

The data should be kept private from all other parties, including anyauthorized third parties who attempts to recover the statistics. Weformalize this notion using ε-differential privacy for the respondentsas follows:

Given the above framework, the system provides ε-differential privacy iffor all data (x^(n),y^(n)) and ({dot over (x)}^(n),{dot over (y)}^(n))in X^(n)×Y^(n), within Hamming distance d_(H)((x^(n),y^(n)),({dot over(x)}^(n),{dot over (y)}^(n)))≦1, and all S

O×K_(A)×K_(B),

Pr[(O,K _(A) ,K _(B))εS|(X ^(n) ,Y ^(n))=(x ^(n) ,y ^(n))]≦e ^(ε)Pr[(O,K _(A) ,K _(B))εS|(X ^(n) ,Y ^(n))=({dot over (x)} ^(n) ,{dot over(y)} ^(n))].

This rigorous definition of privacy satisfies the privacy axioms. Underthe assumption that the data are i.i.d., this definition results in astrong privacy guarantee. An attacker with knowledge of all except oneof the respondents cannot recover the data of the sole missingrespondent.

Utility for Authorized Third Parties:

The utility of the estimate is measured by the expected l₂-norm error ofthis estimated type vector, given by

E∥T _(X) _(n) _(,Y) _(n) −T _(X) _(n) _(,Y) _(n) ∥2,

with the goal being the minimization of this error.

System Complexity:

The communication and computational complexity of the system are also ofconcern. The computational complexity can be represented by thecomplexity of implementing the mappings (F_(A), F_(B), M and g) thatspecify a given system. Ideally, we try to minimize the computationalcomplexity of all of these mappings, and simplifying the operations thateach party performs. The communication requirements is given by thecardinalities of the symbol alphabets (O_(A), O_(B), K_(A), K_(B), andO). The logarithms of these alphabet sizes indicate the sufficientlength for the messages that must be transmitted in our system.

Proposed System and Analysis

We describe the details of our system, and analyze its privacy andutility performance. First, we describe how our system utilizes samplingand additive encryption, enabling the server to join and perturbencrypted data in order to facilitate the release of sanitized data tothe third party. Next, we analyze the privacy of our system and showthat our sampling enhances privacy, thereby reducing the amount of noisethat is added during the perturbation step to obtain a desired level ofprivacy. Finally, we analyze the accuracy of the joint typereconstruction, producing a bound on the utility as a function of thesystem parameters, that is to say, the noise added during perturbation,and the sampling factor.

System Architecture

FIG. 2 shows the overall system, and the data sanitization and releaseprocedure is outlined by the following steps:

1. Sampling: The clients respectively randomly sample 205 their data Xand Y, producing shortened sequences X_(n) and Y_(n)2. Encryption: The clients encrypt 210 the shortened sequence beforepassing the shortened sequences to the server.3. Perturbation: The third party processor combines and perturbs 220 theencrypted sequences.4. Release: The third party processor 225 obtains the sanitized datafrom the server and the encryption keys from the clients, allowing theapproximate recovery of data statistics T_(X, Y) 230.

It is assumed that the clients, server and third party each have atleast one processor to perform the steps outlined above. The processorsare connected to memories, e.g., databases, and input/output interfacesfor communicating with each other.

A key aspect of these steps is that the encryption and perturbationschemes commute, thus allowing the server to essentially performperturbation on the encrypted sequences, and for the authorized thirdparty to subsequently decrypt the perturbed data. In this section, wedescribe the details of each step from a theoretical perspective byapplying mathematical abstractions and assumptions. Later, we describepractical implementations towards the realizing our system.

Another aspect of the invention is that the sequence of sampling,randomization and encryption can be altered so long as the encryptionparameters are determined by the client processors and the decryptionparameters are provided to the authorized third-party processor.

FIG. 3 shows the overall process using the abstractions.

Sampling:

The data clients sample 205 the data to reduced the length of databasesequences (X^(n), Y^(n)) to i randomly drawn samples, where m issubstantially smaller than n. The samples are drawn uniformly withoutreplacement and that both the clients sample at the same locations. Welet

({tilde over (X)} ^(m) ,{tilde over (Y)} ^(m)):=({tilde over (X)} ₁, . .. , {tilde over ({circumflex over (X)}_(m) ,{tilde over (Y)} ₁ , . . . ,{tilde over (Y)} _(m))

denote the intermediate result after sampling. Mathematically, thesampling result is described by

({tilde over (X)} _(i) ,{tilde over (Y)} _(i))=(X _(I) _(i) ,Y _(I) _(i)) for all i in {1, . . . , m},

where I₁, . . . , I_(m) are drawn uniformly without replacement from {1,. . . , n}.

Encryption:

The clients independently encrypt 210 the sampled data sequences with anadditive (one-time pad) encryption scheme. Alice uses an independentuniform key sequence V^(m)εX^(m) 301, and produces the encryptedsequence

{tilde over (X)} ^(m) :={tilde over (X)} ^(m) ⊕V ^(m):=({tilde over (X)}₁ +V ₁ , . . . , {tilde over (X)} _(m) +V _(m)),

where ⊕ denotes addition applied element-by-element over the sequences.The addition operation can be any suitably defined group additionoperation over the finite set.

Similarly, Bob encrypts his data, with the independent uniform keysequence W^(m)εY^(m) 302, to produce the encrypted sequence

{tilde over (Y)} ^(m) :={tilde over (Y)} ^(m) ⊕W ^(m):=({tilde over (Y)}₁ +W ₁ , . . . , {tilde over (Y)} _(m) +W _(m)).

Alice and Bob send these encrypted sequences to the server, and providethe keys to the third party to enable data release.

Perturbation:

The server joins the encrypted data sequences, forming (({hacek over(X)}₁,{hacek over (Y)}₁), . . . , ({hacek over (X)}_(m),{hacek over(Y)}_(m))), and perturbs 220 the sequences by applying an independentPRAM mechanism, producing the perturbed results ( X ^(m), Y ^(m)). Eachjoined and encrypted sample ({hacek over (X)}_(i),{hacek over (Y)}_(i))is perturbed independently and identically according to a conditionaldistribution, P _(X, Y|{hacek over (X)},{hacek over (Y)}) that specifiesa random mapping from (X×Y) to (X×Y).

Using the matrix A:=P _(X, Y|{hacek over (X)},{hacek over (Y)}) torepresent the conditional distribution, this operation can berepresented by

${P_{{\overset{\_}{X}}^{m},{{\overset{\_}{Y}}^{m}|{\overset{\Cup}{X}}^{m}},{\overset{\Cup}{Y}}^{m}}\left( {{\overset{\_}{x}}^{m},\left. {\overset{\_}{x}}^{m} \middle| {\overset{\Cup}{x}}^{m} \right.,{\overset{\Cup}{x}}^{m}} \right)} = {\prod\limits_{i = 1}^{m}\; {{A\left\lbrack {\left( {{\overset{\_}{x}}_{i},{\overset{\_}{y}}_{i}} \right),\left( {{\overset{\Cup}{x}}_{i},{\overset{\Cup}{y}}_{i}} \right)} \right\rbrack}.}}$

By design, we specify that A is a γ-diagonal matrix, for a parameterγ>1, given by

${A\left\lbrack {\left( {\overset{\_}{x},\overset{\_}{y}} \right),\left( {\overset{\Cup}{x},\overset{\Cup}{y}} \right)} \right\rbrack}:=\left\{ \begin{matrix}{{\gamma/q},} & {{{{if}\mspace{14mu} \left( {\overset{\_}{x},\overset{\_}{y}} \right)} = \left( {\overset{\Cup}{x},\overset{\Cup}{y}} \right)},} \\{{1/q},} & {{otherwise},}\end{matrix} \right.$

andwhere q:=(γ+|X∥Y|−1) is a normalizing constant.

Release:

In order to recover the statistics as shown in FIG. 4, the third partyobtains the sampled, encrypted, and perturbed data sequences ( X ^(m), Y^(m)) 401, from the server, and the encryption keys, V^(m) and W^(m),from the clients. The third party decrypts 410 and recovers thesanitized data given by

({circumflex over (X)} ^(m) ,Ŷ ^(m)):=( X ^(m) ⊕V ^(m) , Y ^(m) ⊕W^(m)),

which is effectively the data sanitized by the sampling and the PRAM.The third party produces the joint type estimate by inverting the matrixA and multiplying the matrix A with the joint type of the sanitized dataas follows

{circumflex over (T)} _(X) _(n) _(,Y) _(n) :=A ⁻¹ T_({circumflex over (X)}) _(m) _(,Ŷ) _(m) .

Due to the γ-diagonal property of the matrix A, the PRAM perturbation isessentially an additive operation that commutes with the additiveencryption. This allows the server to perturb the encrypted data whilepreserving the perturbation when the encryption is removed. Thedecrypted, sanitized data ({circumflex over (X)}^(m),Ŷ^(m)) recovered bythe third party are essentially the sampled data perturbed by the PRAM.

Sampling Enhances Privacy

We analyze the privacy of our system. Specifically, we show how samplingin conjunction with PRAM enhances the overall privacy for therespondents in comparison to using PRAM alone. Note that if PRAM, withthe γ-diagonal matrix A, was applied alone to the full databases, theresulting perturbed data would have ln(γ)-differential privacy. In thefollowing, we show that the combination of sampling and PRAM results insampled and perturbed data with enhanced privacy. Our system providesε-differential privacy for the respondents, where

$\begin{matrix}{ɛ = {{\ln \left( \frac{n + {m\left( {\gamma - 1} \right)}}{n} \right)}.}} & (1)\end{matrix}$

We use the following notation for the set of all possible samplings

Θ:={π|π:=(π₁, . . . , π_(m))ε{1, . . . , n} ^(m),π_(i)≠π_(j) ,∀i≠j}.

The sampling locations (I₁, . . . , I_(m)) are uniformly drawn from theset Θ. We define Θ_(k):={πεΘ|∃i,π_(i)=k} to denote the subset ofsamplings that select location k, and Θ_(k) ^(c):=Θ\Θ_(k) to denote thesubset of samplings that do not select location k. For πεΘ_(k).

We define Θ_(k)(π):={π′εΘ_(k) ^(c)|d_(H)(π,π′)=1} as the subset of Θ_(k)^(c) that replaces the selection of location k with any othernon-selected location. We denote πεΘ as a sampling function for the datasequences, that is, π(X^(n)):=(X_(π) ₁ , . . . , X_(πdi m)), andsimilarly for π(Y^(n)). Using the above notation, we can rewrite thefollowing conditional probability,

${{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|X^{n}},Y^{n}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| x^{n} \right.,y^{n}} \right)} = {{\sum\limits_{\pi \in \Theta}{\frac{1}{\Theta }{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|{\overset{\sim}{X}}^{m}},{\overset{\sim}{Y}}^{m}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| {\pi \left( x^{n} \right)} \right.,{\pi \left( y^{n} \right)}} \right)}}} = {{\frac{1}{\Theta }\left\lbrack {{\sum\limits_{\pi \in \Theta_{k}}{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|{\overset{\sim}{X}}^{m}},{\overset{\sim}{Y}}^{m}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| {\pi \left( x^{n} \right)} \right.,{\pi \left( y^{n} \right)}} \right)}} + {\sum\limits_{\pi^{\prime} \in \Theta_{k}^{c}}{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|{\overset{\sim}{X}}^{m}},{\overset{\sim}{Y}}^{m}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| {\pi^{\prime}\left( x^{n} \right)} \right.,{\pi^{\prime}\left( y^{n} \right)}} \right)}}} \right\rbrack} = {\frac{1}{\Theta }\left\lbrack {{\sum\limits_{\pi \in \Theta_{k}}{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|{\overset{\sim}{X}}^{m}},{\overset{\sim}{Y}}^{m}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| {\pi \left( x^{n} \right)} \right.,{\pi \left( y^{n} \right)}} \right)}} + {\frac{1}{m}{\sum\limits_{\pi^{\prime} \in {\Theta_{k}{(\pi)}}}{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|{\overset{\sim}{X}}^{m}},{\overset{\sim}{Y}}^{m}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| {\pi^{\prime}\left( x^{n} \right)} \right.,{\pi^{\prime}\left( y^{n} \right)}} \right)}}}} \right\rbrack}}}},$

where in the last equality we have rearranged the summations to embedthe summation over π′εΘ_(k) ^(c) into the summation over πεΘ_(k).

Note that summing over all π′εΘ_(k)(π) within a summation over allπεΘ_(k) covers all π′εΘ_(k) ^(c), but overcounts each π′ exactly m timesbecause each π′εΘ_(k) ^(c) belongs to m of the Θ_(k)(π) sets across allπεΘ_(k). Hence, a (1/m) term has been added to account for thisovercount.

For the above expansion, we use the following shorthand notation for thesummation terms:

α(π):=P _({circumflex over (X)}) _(m) _(,Ŷ) _(m) _(|{tilde over (X)})_(m) _(,{tilde over (Y)}) _(m) ({circumflex over (x)} ^(m) ,ŷ ^(m)|π(x^(n)),π(y ^(n)))

β(π):=P _({circumflex over (X)}) _(m) _(,Ŷ) _(m) _(|{tilde over (X)})_(m) _(,{tilde over (Y)}) _(m) ({circumflex over (x)} ^(m) ,ŷ^(m)|π({dot over (x)} ^(n)),π({dot over (y)} ^(n)))

Thus, the probability ratio can be written as

${\frac{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|X^{n}},Y^{n}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| x^{n} \right.,y^{n}} \right)}{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|X^{n}},Y^{n}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| {\overset{.}{x}}^{n} \right.,{\overset{.}{y}}^{n}} \right)} = {{\frac{\sum\limits_{\pi \in \Theta_{k}}\left( {{\alpha (\pi)} + {\frac{1}{m}{\sum\limits_{\pi^{\prime} \in {\Theta_{k}{(\pi)}}}{\alpha \left( \pi^{\prime} \right)}}}} \right)}{\sum\limits_{\pi \in \Theta_{k}}\left( {{\beta (\pi)} + {\frac{1}{m}{\sum\limits_{\pi^{\prime} \in {\Theta_{k}{(\pi)}}}{\beta \left( \pi^{\prime} \right)}}}} \right)} \leq {\max\limits_{\pi \in \Theta_{k}}\frac{{\alpha (\pi)} + {\frac{1}{m}{\sum\limits_{\pi^{\prime} \in {\Theta_{k}{(\pi)}}}{\alpha \left( \pi^{\prime} \right)}}}}{{\beta (\pi)} + {\frac{1}{m}{\sum\limits_{\pi^{\prime} \in {\Theta_{k}{(\pi)}}}{\beta \left( \pi^{\prime} \right)}}}}}} = \frac{{\alpha \left( \pi^{*} \right)} + {\frac{1}{m}{\sum\limits_{\pi^{\prime} \in {\Theta_{k}{(\pi^{*})}}}{\alpha \left( \pi^{\prime} \right)}}}}{{\beta \left( \pi^{*} \right)} + {\frac{1}{m}{\sum\limits_{\pi^{\prime} \in {\Theta_{k}{(\pi^{*})}}}{\beta \left( \pi^{\prime} \right)}}}}}},$

where π*εΘ_(k) denotes the sampling that maximizes the ratio. Given theγ-diagonal structure of the matrix A, we have that

γ⁻¹α(π*)≦β(π*),

because (π*(x^(n)),π*(y^(n))) and (π*({dot over (x)}^(n)),π*({dot over(y)}^(n))) differ in only one location,

γ⁻¹α(π*)≦α(π′),∀π′εΘ_(k)(π*),

because (π*(x_(n)),π*(y^(n))) and (π′(x^(n)),π′(y^(n))) differ in onlyone location, and

α(π′)=β(π′),∀π′εΘ_(k)(π*),

because (π′(x^(n)),π′(y^(n)))=(π′({dot over (x)}^(n)),π′({dot over(y)}^(n))). Given these constraints, we can bound the likelihood ratioas

${{\frac{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|X^{n}},Y^{n}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| x^{n} \right.,y^{n}} \right)}{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|X^{n}},Y^{n}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| {\overset{.}{x}}^{n} \right.,{\overset{.}{y}}^{n}} \right)} \leq \frac{{\alpha \left( \pi^{*} \right)} + {\frac{1}{m}{\sum\limits_{\pi^{\prime} \in {\Theta_{k}{(\pi^{*})}}}{\alpha \left( \pi^{\prime} \right)}}}}{{\beta \left( \pi^{*} \right)} + {\frac{1}{m}{\sum\limits_{\pi^{\prime} \in {\Theta_{k}{(\pi^{*})}}}{\beta \left( \pi^{\prime} \right)}}}} \leq \frac{{\alpha \left( \pi^{*} \right)} + {\frac{n - m}{m}\gamma^{- 1}{\alpha \left( \pi^{*} \right)}}}{{\gamma^{- 1}{\alpha \left( \pi^{*} \right)}} + {\frac{n - m}{m}\gamma^{- 1}{\alpha \left( \pi^{*} \right)}}}} = {\frac{n + {m\left( {\gamma - 1} \right)}}{n} = e^{\varepsilon}}},$

by bounding the likelihood ratio with e^(ε).

To show ε-differential privacy for a given ε, we only need to upperboundthe probability ratio by e^(ε), as done above. A natural question is ifthis bound is tight, that is, whether there exists a smaller ε for whichthe bound also holds, hence making the system more private. Thefollowing example shows that the value for ε is tight.

Example

Let a and b be two distinct elements in (X×Y). Let (x^(n),y^(n))=(b, a,a, . . . , a), ({dot over (x)}^(n),{dot over (y)}^(n))=(a, a, . . . , a)and (x^(m),y^(m))=(b, b, . . . , b). Let E denote the event that thefirst element (where the two databases differ) is sampled, which occurswith probability (m/n). We can determine the likelihood ratio as follows

$\frac{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|X^{n}},Y^{n}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| x^{n} \right.,y^{n}} \right)}{P_{{\hat{X}}^{m},{{\hat{Y}}^{m}|X^{n}},Y^{n}}\left( {{\hat{x}}^{m},\left. {\hat{y}}^{m} \middle| {\overset{.}{x}}^{n} \right.,{\overset{.}{y}}^{n}} \right)} = {\frac{{{\Pr \lbrack E\rbrack}{\gamma \left( {1/q} \right)}^{m}} + {\left( {1 - {\Pr \lbrack E\rbrack}} \right)\left( {1/q} \right)^{m}}}{\left( {1/q} \right)^{m}} = {\frac{n + {m\left( {\gamma - 1} \right)}}{n} = {e^{\varepsilon}.}}}$

Thus, the value of ε is indeed tight.

Because of the privacy analysis for given system parameters of databasesize n, number of samples m, and desired level privacy ε, the level ofPRAM perturbation, specified by the γ parameter of the matrix A, is

$\begin{matrix}{\gamma = {1 + {\frac{n}{m}{\left( {e^{ɛ} - 1} \right).}}}} & (2)\end{matrix}$

Privacy against the server is obtained as a consequence of theone-time-pad encryption 210 performed on the data before thetransmission to the server. The encryptions received by the server arestatistically independent of the original database as a consequence ofthe independence and uniform randomness of the keys.

Utility Analysis

In this subsection, we analyze the utility of our system. Our mainresult is a theoretical bound on the expected l₂-norm of the joint typeestimation error. Analysis of this bound indicates the tradeoffs betweenutility and privacy level ε as function of sampling parameter in andPRAM perturbation level γ. Given this error bound, we can compute theoptimal sampling parameter m for minimizing the error bound whileachieving a fixed privacy level ε.

For our system, the expected l₂-norm of the joint type estimate isbounded by

$\begin{matrix}{{E{{{\hat{T}}_{X^{n},Y^{n}} - T_{X^{n},Y^{n}}}}_{2}} \leq {\frac{{c\sqrt{{X}{Y}}} + 1}{\sqrt{m}}.}} & (3)\end{matrix}$

where c is the condition number of the γ-diagonal matrix A, given by

$c = {1 + {\frac{{X}{Y}}{\gamma - 1}.}}$

Applying the triangle inequality, we can bound the error as the sum ofthe error introduced by sampling and the error introduced by PRAM, asfollows,

E∥A ⁻¹ T _({circumflex over (X)}) _(m) _(,Ŷ) _(m) −T _(X) _(n) _(,Y)_(n) ∥₂ ≦E∥T _({tilde over (X)}) _(m) _(,{tilde over (Y)}) _(m) −T _(X)_(n) _(,Y) _(n) ∥₂ +E∥A ⁻¹ T _({circumflex over (X)}) _(m) _(,Ŷ) _(m) −T_({tilde over (X)}) _(m) _(,{tilde over (Y)}) _(m) ∥₂.

We analyze and bound the sampling error,

E∥T _({tilde over (X)}) _(m) _(,{tilde over (Y)}) _(m) −T _(X) _(n)_(,Y) _(n) ∥₂,

by bounding the conditional expectation

E[∥T _(X) _(m) _(,Y) _(m) −T _(X) _(n) _(,Y) _(n) ∥₂ |T _(X) _(n) _(,Y)_(n) ].

For a given (x,y)εX×Y, the sampled type, T_({tilde over (X)}) _(m)_(,{tilde over (Y)}) _(m) (x,y), conditioned on T_(X) _(n) _(,Y) _(n) ,is a hypergeometric random variable normalized by m, with expectationand variance given by

$\mspace{20mu} {{{E\left\lbrack {T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}\left( {x,y} \right)} \middle| T_{X^{n},Y^{n}} \right\rbrack} = {T_{X^{n},Y^{n}}\left( {x,y} \right)}},{{{Var}\left\lbrack {T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}\left( {x,y} \right)} \middle| T_{X^{n},Y^{n}} \right\rbrack} = {\frac{n\; {T_{X^{n},Y^{n}}\left( {x,y} \right)}\left( {n - {n\; {T_{X^{n},Y^{n}}\left( {x,y} \right)}}} \right)\left( {n - m} \right)}{{mn}^{2}\left( {n - 1} \right)} \leq {\frac{T_{X^{n},Y^{n}}\left( {x,y} \right)}{m}.}}}}$

Applying Jensen's inequality to the conditioned sampling error yields

${E\left\lbrack {{T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}} - T_{X^{n},Y^{n}}}}_{2} \middle| T_{X^{n},Y^{n}} \right\rbrack} \leq \sqrt{\sum\limits_{{({x,y})} \in { \times }}{{Var}\left\lbrack {T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}\left( {x,y} \right)} \middle| T_{X^{n},Y^{n}} \right\rbrack}} \leq {\frac{1}{\sqrt{m}}.}$

Applying the smoothing theorem, the sampling error can be bounded by

${E{{T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}} - T_{X^{n},Y^{n}}}}_{2}} = {{E_{T_{X^{n},Y^{n}}}\left\lbrack {E\left\lbrack {{T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}} - T_{X^{n},Y^{n}}}}_{2} \middle| T_{X^{n},Y^{n}} \right\rbrack} \right\rbrack} \leq {\frac{1}{\sqrt{m}}.}}$

Next, to analyze and bound the PRAM error given by

E∥A ⁻¹ T _({circumflex over (X)}) _(m) _(,Ŷ) _(m) −T _({tilde over (X)})_(m) _(,{tilde over (Y)}) _(m) ∥₂,

we use the following linear algebra.

Let A be an invertible matrix and (x,y) be vectors that satisfy Ax=y.For any vectors ({circumflex over (x)}, ŷ) such that {circumflex over(x)}=A⁻¹ŷ, we have

${\frac{{\hat{x} - x}}{x} \leq {c\frac{{\hat{y} - y}}{y}}},$

where c is the condition number of the matrix A.

To bound the PRAM error, we use the following consequence,

${{{{A^{- 1}T_{{\hat{X}}^{m},{\hat{Y}}^{m}}} - T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}_{2} \leq {c\frac{{T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}_{2}}{{{A\; T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}_{2}}{{T_{{\hat{X}}^{m},{\hat{Y}}^{m}} - {A\; T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}}_{2}}},$

which allows us to bound the conditional expectation of the PRAM erroras follows,

${E\left\lbrack {{{A^{- 1}T_{{\hat{X}}^{m},{\hat{Y}}^{m}}} - T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}_{2} \middle| T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}} \right\rbrack} \leq {c\frac{{T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}_{2}}{{{A\; T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}_{2}}{{E\left\lbrack {{T_{{\hat{X}}^{m},{\hat{Y}}^{m}} - {A\; T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}}_{2} \middle| T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}} \right\rbrack}.}}$

For (x,y)εX×Y, the perturbed and sampled type T_({circumflex over (X)})_(m) _(,Ŷ) _(m) (x,y), conditioned on T_({tilde over (X)}) _(m)_(,{tilde over (Y)}) _(m) , is a Poisson-binomial random variablenormalized by m with expectation and variance given by

$\mspace{20mu} {{{E\left\lbrack {T_{{\hat{X}}^{m},{\hat{Y}}^{m}}\left( {x,y} \right)} \middle| T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}} \right\rbrack} = {\left( {A\; T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}} \right)\left\lbrack {x,y} \right\rbrack}},{{{Var}\left\lbrack {T_{{\hat{X}}^{m},{\hat{Y}}^{m}}\left( {x,y} \right)} \middle| T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}} \right\rbrack} = {\frac{1}{m}{\sum\limits_{{({x^{\prime},y^{\prime}})} \in { \times }}{{T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}\left( {x^{\prime},y^{\prime}} \right)}{A\left\lbrack {\left( {x,y} \right),\left( {x^{\prime},y^{\prime}} \right)} \right\rbrack}{\left( {1 - {A\left\lbrack {\left( {x,y} \right),\left( {x^{\prime},y^{\prime}} \right)} \right\rbrack}} \right).}}}}}}$

We can bound the following conditional expectation using Jensen'sinequality to yield

${{E\left\lbrack {{T_{{\hat{X}}^{m},{\hat{Y}}^{m}} - {A\; T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}}_{2} \middle| T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}} \right\rbrack} \leq \sqrt{\sum\limits_{{({x,y})} \in { \times }}{{Var}\left\lbrack {T_{{\hat{X}}^{m},{\hat{Y}}^{m}}\left( {x,y} \right)} \middle| T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}} \right\rbrack}}} = {\quad{{\begin{bmatrix}{\frac{1}{m}{\sum\limits_{{({x,y})} \in { \times }}{\sum\limits_{{({x^{\prime},y^{\prime}})} \in { \times }}{T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}\left( {x^{\prime},y^{\prime}} \right)}}}} \\{{A\left\lbrack {\left( {x,y} \right),\left( {x^{\prime},y^{\prime}} \right)} \right\rbrack}\left( {1 - {A\left\lbrack {\left( {x,y} \right),\left( {x^{\prime},y^{\prime}} \right)} \right\rbrack}} \right)}\end{bmatrix}^{1/2} \leq \left\lbrack {\frac{1}{m}{\sum\limits_{{({x^{\prime},y^{\prime}})} \in { \times }}{{T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}\left( {x^{\prime},y^{\prime}} \right)}{\sum\limits_{{({x,y})} \in { \times }}{A\left\lbrack {\left( {x,y} \right),\left( {x^{\prime},y^{\prime}} \right)} \right\rbrack}}}}} \right\rbrack^{1/2}} = {\frac{1}{\sqrt{m}}.}}}$

Combining equations yields the bound

$\begin{matrix}{{{{E\left\lbrack {{{A^{- 1}T_{{\hat{X}}^{m},{\hat{Y}}^{m}}} - T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}_{2} \middle| T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}} \right\rbrack} \leq {\frac{c}{\sqrt{m}}\frac{{T_{{\hat{X}}^{m},{\hat{Y}}^{m}}}_{2}}{{{A\; T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}_{2}}} \leq {\frac{c}{\sqrt{m}}\frac{\sqrt{{}{}}{T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}_{1}}{{{A\; T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}_{1}}}} = {c\sqrt{\frac{{}{}}{m}}}},} & (4)\end{matrix}$

which, upon applying the smoothing yields the following bound on thePRAM error

${E\left\lbrack {{{A^{- 1}T_{{\hat{X}}^{m},{\hat{Y}}^{m}}} - T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}_{2} \right\rbrack} = {{E_{T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}\left\lbrack {E\left\lbrack {{{A^{- 1}T_{{\hat{X}}^{m},{\hat{Y}}^{m}}} - T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}_{2} \middle| T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}} \right\rbrack} \right\rbrack} \leq {c{\sqrt{\frac{{}{}}{m}}.}}}$

Combining the individual bounds on the sampling and PRAM error via thetriangle inequality yields the following bound on expected norm-2 errorof the type estimate formed from the sampled and perturbed data.

${E{{{A^{- 1}T_{{\hat{X}}^{m},{\hat{Y}}^{m}}} - T_{{\overset{\sim}{X}}^{m},{\overset{\sim}{Y}}^{m}}}}_{2}} \leq {\frac{{c\sqrt{{}{}}} + 1}{\sqrt{m}}.}$

Because A is a γ-diagonal matrix, its condition number c is given by

$c = {1 + {\frac{{X}{Y}}{\gamma - 1}.}}$

Optimal Sampling Parameter m*

Given a fixed PRAM perturbation parameter γ, the error bound decays onthe order of O(1/√{square root over (m)}) as a function of the samplingparameter m. However, as m increases, ε as given in Equation (1) alsoincreases, decreasing privacy. However, when we fix the overall privacylevel ε, by adjusting γ as a function of m, as given by Equation (2), tomaintain the desired level of privacy, we observe that increasing m toomuch causes the error bound to expand, see FIG. 5.

Intuitively, this can be explained as by having in too large, we need toincrease the PRAM perturbation through lowering γ to maintain the samelevel of privacy, which has the adverse effect of increasing the errorbound through the condition number c.

On the other hand, by having m too small, too few samples are takenresulting in an inaccurate type estimate, see FIG. 5. This balance inadjusting the sampling parameter m shows that there is an optimal samplesize m as a function of the desired level of privacy ε and other systemparameters, see FIG. 5.

The theoretically optimal sample size m* for the error upper bound isgiven by the following. The optimal sampling parameter m* that optimizesthe error bound of Equation (3) is

$\begin{matrix}{m^{*} = {\frac{{n\left( {1 + \sqrt{{}{}}} \right)}\left( {e^{ɛ} - 1} \right)}{\left( {{}{}} \right)^{\frac{3}{2}}}.}} & (6)\end{matrix}$

FIG. 5 shows experimental results of for three privacy levels (ε) as afunction of number of samples m. Each data point represents the expectedl₂ norm of the type error estimated as the empirical mean over 1000independent experiments. with synthetic data with a “uniform”distribution. The three vertical lines indicate theoretically optimalnumber of samples at each respective privacy level.

Although the invention has been described by way of examples ofpreferred embodiments, it is to be understood that various otheradaptations and modifications can be made within the spirit and scope ofthe invention. Therefore, it is the object of the appended claims tocover all such variations and modifications as come within the truespirit and scope of the invention.

We claim:
 1. A method for securely determining aggregate statistics onprivate data, comprising the steps of: sampling, at one or more clients,data X^(n) and Y^(n) to obtain sampled data {tilde over (X)}^(m) and{tilde over (Y)}^(m), wherein m is a sampling parameter substantiallysmaller than a length n of the data; encrypting the sampled data {tildeover (X)}^(m) and {tilde over (Y)}^(m) to obtain encrypted data {hacekover (X)}^(m) and {hacek over (Y)}^(m); combining the encrypted data{hacek over (X)}^(m) and {hacek over (Y)}^(m) to obtain combinedencrypted data; randomizing the combined encrypted data to obtainrandomized data X ^(m), Y ^(m); estimating, at an authorized third-partyprocessor, a joint distribution {circumflex over (T)}_(X) _(n) _(,Y)_(n) of the data X^(n) and Y^(n) from the randomized encrypted data X^(m), Y ^(m), such that a differential privacy requirement on the dataX^(n) and Y^(n) is satisfied.
 2. The method of claim 1, wherein thesampling and encrypting of the data X^(n) and Y^(n) is performed by theone or more client processors, the combining and randomizing areperformed by the server processor, and the estimating is performed bythe authorized third-party processor.
 3. The method of claim 1, whereinthe encryption is performed before the sampling at the client processor.4. The method of claim 1, wherein the randomizing is performed by theclient processor.
 5. The method of claim 1, wherein the encrypted data{hacek over (X)}^(m) and {hacek over (Y)}^(m) is obtained from thesampled data {tilde over (X)}^(m) and {tilde over (Y)}^(m) using astream cipher, and decryption parameters for the stream cipher areprovided to the authorized third-party processor.
 6. The method of claim1, wherein the randomized data X ^(m), Y ^(m) is obtained from theencrypted data {hacek over (X)}^(m) and {hacek over (Y)}^(m) using arandomized response mechanism.
 7. The method of claim 5, wherein theestimating further comprises: reversing, at the authorized third-partyprocessor, the encryption applied to the randomized data X ^(m), Y ^(m),using the decryption parameters provided to the authorized third-partyprocessor by the one or more client processors.
 8. The method of claim1, wherein the one or more client processors perform the sampling,encrypting and randomizing, and the third party processor performs thecombining and estimating.
 9. The method of claim 1, wherein thesampling, randomizing and encrypting can be in any order as long as theencrypting parameters are determined by one or more client processors,and the decrypting parameters are provided to the authorized third-partyprocessor.
 10. A method for securely determining aggregate statistics onprivate data, comprising the steps of: sampling, at one or more clientprocessors, first data and second data to obtain sampled data, wherein asampling parameter is substantially smaller than a length of the data;encrypting the sampled data to obtain encrypted data; combining theencrypted data to obtain combined encrypted data; randomizing thecombined encrypted data to obtain randomized data; estimating, at anauthorized third-party processor, a joint distribution of the first dataand the second data from the randomized encrypted data such that adifferential privacy requirement of the first data and the seconds datais satisfied.
 11. The method of claim 1, wherein the randomized data isobtained for the encrypted data using a randomized response mechanismwherein the randomized response mechanism further comprises:independently altering each data element to any other values in thealphabet set with the same probability, or alternatively retaining thevalue of that element.
 12. The method of claim 1, wherein one or moreclient processors perform the encrypting, and the server processorperform the combining, sampling, and randomizing.